Preliminaries

First, we'll load some required libraries and set some global options


In [1]:
import pandas as pd          # our core data analysis toolkit
import numpy as np           
import pylab as pl           # plotting libraries
from ggplot import *
from pandas import read_json # a function for reading data in JSON files

# This option shows our plots directly in iPython Notebooks
%matplotlib inline 

# This option gives a more pleasing visual style to our plots
pd.set_option('display.mpl_style', 'default')

# The location of our playtest data file
filepath = "2014-05-13 makescape playtest.json"

Data Analysis

Loading and Cleaning

Below we'll load the data and sort by increasing timestamp.


In [2]:
def loadDataSortedByTimestamp(filepath):
    x = read_json(filepath)
    x = x.sort(columns='timestamp')
    x.index = range(0, len(x))
    return(x)

ms = loadDataSortedByTimestamp(filepath)

Now that our data is loaded with the variable ms (I chose it as an abbreviation of MakeScape), let's look at it and make sure it's sane. One of the first things I'll do is check the list of columns that our data comes with.


In [3]:
ms.columns


Out[3]:
Index([u'_id', u'ada_base_types', u'adage_version', u'application_name', u'application_version', u'board_mode', u'component_list', u'created_at', u'deviceInfo', u'fish', u'fish_list', u'game', u'game_id', u'key', u'mode_name', u'num_batteries', u'num_leds', u'num_resistors', u'num_timers', u'player_name', u'player_names', u'playspace_id', u'playspace_ids', u'reason', u'resistance', u'session_token', u'timed_out', u'timestamp', u'updated_at', u'user_id', u'virtual_context', u'visability_mode', u'voltage'], dtype='object')

Whoa! That is alot of columns. 33 columns, to be exact. We can check that by calling Python's function for determining the length of a collection:


In [4]:
len(ms.columns)


Out[4]:
33

But we should also check how many rows (in this case, how many distinct events) we have in our dataset.


In [5]:
len(ms) # returns 8505


Out[5]:
8505

In [6]:
columns = ['key', 'timestamp']
ms.head(n=5)[columns]


Out[6]:
key timestamp
0 ADAGEStartSession 1398178271860
1 ADAGEStartSession 1398190767768
2 ADAGEStartSession 1398191616469
3 ADAGEStartSession 1398192512628
4 ADAGEStartSession 1398192887546

5 rows × 2 columns

What's less-than-helpful right now is that those timestamps are just raw integers. We want to make sure those integers actually represent times when data could reasonably have been collected (and not, say, January of the year 47532, which actually happened once).

Thankfully, pandas comes with a function that can convert UNIX Epoch Time integers into human-recognizable dates. In this case, what we'll do is create a new column called human-readable-timestamps by applying the pandas Timestamp() function to our existing integers. Then we'll check the data.


In [7]:
ms['human-readable-timestamp'] = ms.timestamp.apply(lambda x: pd.Timestamp(x, unit='ms'))
columns = ['key', 'timestamp', 'human-readable-timestamp']
ms[columns].head()


Out[7]:
key timestamp human-readable-timestamp
0 ADAGEStartSession 1398178271860 2014-04-22 14:51:11.860000
1 ADAGEStartSession 1398190767768 2014-04-22 18:19:27.768000
2 ADAGEStartSession 1398191616469 2014-04-22 18:33:36.469000
3 ADAGEStartSession 1398192512628 2014-04-22 18:48:32.628000
4 ADAGEStartSession 1398192887546 2014-04-22 18:54:47.546000

5 rows × 3 columns

Visualizing Events Over Time

So, it seems like we have way too many MakeConnectComponent events. Earlier, I explained that we're averaging more than one MakeConnectComponent event per second. But what if we wanted to think about whether that average really describes a typical time slice of our data? In other words, we might want to know how our MakeConnectcomponent events are distributed over time.

One way to think about that distribution is to ask: when do our MakeConnectComponent events occur over time? Below, I'm going to use the ggplot package to look at the cumulative distribution of the connection events. The syntax may seem wonky and complicated at first, but it's actually an elegant implementation of ggplot2, itself an implementation of Leland Wilkinson's Grammar of Graphics. If at first you're stymied by it, don't worry. I'll try to help break it down for you.

First, we're doing some basic manipulation to get a cumulative sum column. This is actually so dumb I'm almost embarrased. I create a column by applying a lambda function to the timestamp column that always returns 1, then I just sum cumulatively over that column. Lastly, I apply another lambda function to format our timestamp (currently an integer) into a nicer-formatted timestamp.

I'll break these plotting lines down one major line at a time:

  1. I create a ggplot object, which is essentially the basic kind of object from whence all plots are constructed in ggplot. aes() just stands for aesthetic mapping, where I'm telling ggplot how to map data to graphical features. In this case, I'm saying "map the values in the timestamp1 column to the x position of this plot, and map the values of cumulativeCount to the y position. It may seem trivial now, but the power of aesthetic mappings like this is that I can also map quantities (or categories) to other graphical properties, for example mapping other columns in my dataframe to the aesthetic properties of color or shape. For now, we'll just stick with mapping quantities to cartesian x and y coordinates. When creating the ggplot object, I also tell it what dataframe I'm talking about, so the local names of timestamp1 and cumulativeCount make sense in scope.
  2. What we see now is a common pattern in ggplot-style programming. I can literally add a layer to my plot using the plus operator. Here, I'm telling ggplot that I want it to apply my chosen aesthetic mappings using a line geometry, which means it will connect each discrete datapoint with a line. (An alternative geometry would have been a simple point geometry, geom_point(), which would give us a bivariate scatterplot instead of a lineplot.
  3. The remaining lines add special options to my plot using the same compositional syntax of the plus operator. Here, I'm using special functions ggtitle(), xlab(), and ylab() to set the text for the plot title and axis labels.
  4. Next I just use a simple print() call to make sure my plot shows up in my interactive session.
  5. Finally, I use a convenience function called ggsave() to save my plot directly to a file. ggsave() is smart and it detects the desired type of output file based on the suffix you pass in as a filename. In this case I'm using the .PNG format for images, but if I wanted an infinitely scalable image I could have used a .PDF extension.

In [8]:
# Manipulating the data to get a cumulative sum
# and nicely formatted timestamps
connectionEvents = ms[ms.key == 'MakeConnectComponent']
connectionEvents['cumulativeCount'] = connectionEvents.timestamp.apply(lambda x: 1).cumsum()
connectionEvents['timestamp1'] = connectionEvents.timestamp.apply(lambda x: pd.Timestamp(x, unit='ms')) 

# Creating the basic plot
p = ggplot(aes(x='timestamp1',
               y='cumulativeCount'),
           data=connectionEvents)
p = p + geom_line()
p = p + ggtitle('Cumulative Distribution of MakeConnectComponent Events')
p = p + xlab('Time')
p = p + ylab('Event Count')

# Showing the plot
print(p)

# Saving the plot
ggsave(plot=p,
       filename='cumulativeDistributionOfMakeConnectComponent1.png')


<ggplot: (295352553)>
Saving 11.0 x 8.0 in image.

This plot has a number of interesting features. Easily the most salient feature is the giant flat-line in the center. It looks like there were effectively no MakeConnectComponent events between between 2000 hrs on the first day and 1400 hrs on the second day. And that seems entirely reasonable: the game likely would have been shut off during the night, then fired up again for testing the next day.

The challenge is that on either side of the flatline the curves are quite steep. That means there could be a fair amount of information hiding in the areas of the plot where there was activity, but it's up to us to extract out that long, boring middle portion where nothing is happening. How can we do that?

Well, suppose what we wanted was to create two new plots, call them Day 1 and Day 2, that have just the interesting parts: the parts where the slope of the cumulative distribution is nonzero. (Note that when the slope is nonzero, that's when action is happening and we're registering events over time.) If we wanted two separate plots, all we have to is figure out when, exactly, that boring stretch of time starts and when, exactly, it ends. And, one way to do that would be if we could somehow generate the time that elapses between each successive event in our dataset. So let's do that.

Time Deltas

We're going to compute the time difference between successive events, which I'm calling time deltas. It's worth taking a second to think about how the delta is defined:

For the $i$th event, the $\Delta$ value is given by the simple equation below, where $t_i$ is the timestamp of the $i$th event:

$\Delta = t_i - t_{i-1}$

As a result, the very first event $(i = 0)$ in a series will have a diff value of NaN or NaT (Not a Time), because the -1st event is undefined. But the second event will have a diff value of (Time of Second event - Time of First event). The very last event will also have a value: (Time of Last event - Time of Penultimate event).

The handy thing about pandas is that every data series (and a column counts as data series) has a diff() method, which does exactly what we want: it computes successive pairwise differences between events. If we apply the .diff() function to our timestamps, we'll have exactly what we want: a column of numbers where each number represents the time elapsed since the event that came before.


In [9]:
connectionEvents['delta1'] = connectionEvents.timestamp1.diff()

And, now that we have our column of deltas, how can we figure out where the big boring part starts? Well, the big boring part lasts a long time: multiple hours. So, what we're looking for is a huge time delta. Say a delta of more than five hours.


In [10]:
connectionEvents[connectionEvents.delta1 > np.timedelta64(5, 'h')]['timestamp']


Out[10]:
4443    1400077592798
Name: timestamp, dtype: int64

In [11]:
whenBoringPartEnds = 1400077592798 
day1 = connectionEvents[connectionEvents.timestamp < whenBoringPartEnds]
day2 = connectionEvents[connectionEvents.timestamp >= whenBoringPartEnds]

# Creating the day1 plot
p = ggplot(aes(x='timestamp1',
               y='cumulativeCount'),
           data=day1)
p = p + geom_line()
p = p + ggtitle('Cumulative Distribution of MakeConnectComponent Events\nDay1')
p = p + xlab('Time')
p = p + ylab('Event Count')

# Showing the plot
print(p)

# Saving the plot
ggsave(plot=p,
       filename='cumulativeDistributionDay1.png')

# Creating the day2 plot
p = ggplot(aes(x='timestamp1',
               y='cumulativeCount'),
           data=day2)
p = p + geom_line()
p = p + ggtitle('Cumulative Distribution of MakeConnectComponent Events\nDay2')
p = p + xlab('Time')
p = p + ylab('Event Count')

# Showing the plot
print(p)

# Saving the plot
ggsave(plot=p,
       filename='cumulativeDistributionDay2.png')


<ggplot: (283063801)>
Saving 11.0 x 8.0 in image.
<ggplot: (295742165)>
Saving 11.0 x 8.0 in image.

Higher Order Deltas

We can also compute second- and third-order deltas to explore the successive rates at which time deltas are changing:

$\Delta_2 = \Delta_{1_{n}} - \Delta_{1_{n-1}}$

$\Delta_3 = \Delta_{2_{n}} - \Delta_{2_{n-1}}$

Looking for zeroes

For each subsequent event where the timestamp is the same, we get a zero delta. So, if four successive events all share the same timestamp, they culminate in a fourth event whose delta3 is zero.

Let's do a little inspection and find events whose third and fourth order deltas are zero. We'll use the pandas method diff(), which takes an array of data and computes the difference between contiguous pairwise elements.

Our code below computes successive deltas (up to fourth-order), and assigns each array of deltas to its own column in the dataframe.


In [12]:
ms['delta1'] = ms.timestamp.diff()
ms['delta2'] = ms.delta1.diff()
ms['delta3'] = ms.delta2.diff()
ms['delta4'] = ms.delta3.diff()

# A boolean expression to select events where deltas 1–3 are all zero
thirdOrderZeroes = (ms.delta3 == 0) & (ms.delta2 == 0) & (ms.delta1 == 0) 

# The columns we'll want to view
columns = ['key', 'timestamp', 'delta1', 'delta2', 'delta3', 'delta4']

ms[thirdOrderZeroes][columns]


Out[12]:
key timestamp delta1 delta2 delta3 delta4
2482 MakeDisconnectComponent 1400006485499 0 0 0 -2102
2506 MakeDisconnectComponent 1400006510313 0 0 0 -1886
2526 MakeDisconnectComponent 1400006521978 0 0 0 -3534
2572 MakeDisconnectComponent 1400006562692 0 0 0 -4151
2623 MakeDisconnectComponent 1400006622320 0 0 0 -1
2646 MakeDisconnectComponent 1400006634386 0 0 0 -685
2647 MakeDisconnectComponent 1400006634386 0 0 0 0
2672 MakeDisconnectComponent 1400006645003 0 0 0 -1
2958 MakeDisconnectComponent 1400006866421 0 0 0 -2335
3018 MakeDisconnectComponent 1400006896668 0 0 0 -2534
3200 MakeDisconnectComponent 1400006993361 0 0 0 -1301
3351 MakeDisconnectComponent 1400007068806 0 0 0 -850
3480 MakeDisconnectComponent 1400007157316 0 0 0 -781
3512 MakeDisconnectComponent 1400007196213 0 0 0 -1518
3570 MakeCaptureFish 1400007234810 0 0 0 -51
3667 MakeDisconnectComponent 1400007308155 0 0 0 -3251
3785 MakeSnapshot 1400007410096 0 0 0 -1783
4199 MakeDisconnectComponent 1400007951843 0 0 0 -768
7955 MakeCircuitCreated 1400083772689 0 0 0 -1300
8022 MakeDisconnectComponent 1400083848640 0 0 0 -785
8088 MakeCaptureFish 1400083893390 0 0 0 -152
8123 MakeCaptureFish 1400083924105 0 0 0 -1085

22 rows × 6 columns

Inspecting chains of subsequent events

Event 2482 had a delta3 of zero. Let's look in its neighborhood (namely the 15 events that bracket it) and see what the timestamps of the preceding/subsequent events were.


In [13]:
ms[2470:2485][columns]


Out[13]:
key timestamp delta1 delta2 delta3 delta4
2470 MakeConnectComponent 1400006472299 0 -401 -307 -490
2471 MakeConnectComponent 1400006472730 431 431 832 1139
2472 MakeCircuitCreated 1400006472730 0 -431 -862 -1694
2473 MakeConnectComponent 1400006472730 0 0 431 1293
2474 MakeSnapshot 1400006473382 652 652 652 221
2475 MakeSpawnFish 1400006473483 101 -551 -1203 -1855
2476 MakeSpawnFish 1400006473484 1 -100 451 1654
2477 MakeSnapshot 1400006478397 4913 4912 5012 4561
2478 MakeSnapshot 1400006483397 5000 87 -4825 -9837
2479 MakeCaptureFish 1400006485499 2102 -2898 -2985 1840
2480 MakeDisconnectComponent 1400006485499 0 -2102 796 3781
2481 MakeDisconnectComponent 1400006485499 0 0 2102 1306
2482 MakeDisconnectComponent 1400006485499 0 0 0 -2102
2483 MakeDisconnectComponent 1400006485514 15 15 15 15
2484 MakeDisconnectComponent 1400006485514 0 -15 -30 -45

15 rows × 6 columns

FishCapture events are followed by DisconnectComponent events that almost all share the same timestamp

It looks like event 2482 is part of a sequence: a fish was captured at 2479, and (because when a fish gets captured it blows out a circuit) a number of components got disconnected in events 2480-2484.

Note that not all the disconnect events share the same timestamp.

Viewing a histogram of time deltas

I mentioned in the last section that there's a huge problem with our data. Let's take a look again at the counts of different types of events, but this time we'll focus on the top five most frequent events:


In [14]:
topFiveMostFrequentEvents = ms.groupby('key').count().sort(columns=['timestamp'], ascending=False)[:5]
topFiveMostFrequentEvents['timestamp']


Out[14]:
key
MakeConnectComponent       2609
MakeSnapshot               2086
MakeDisconnectComponent     943
MakeAddComponent            891
MakeRemoveComponent         840
Name: timestamp, dtype: int64

In our implementation of events, MakeConnectComponent should be triggered once the system detects that two circuit elements have become connected (say, when a player bumps a resistor and a battery together). MakeDisconnectComponent should be triggered once the system detects that two connected elements have become disconnected (say, when a player swipes a finger across a wire to cut it.)

That leads to Problem 1

Problem 1 - There are more connect events than there are disconnect events

And not just more, way more. Almost three times more. If you think about it, this doesn't make sense at all.

In our game, each block has a lone positive terminal and a lone negative terminal. And, each terminal only accepts a maximum of one connection for simplicity. When players add blocks to the table, the blocks don't start as being not connected to anything. The first MakeConnectComponent event should happen when two free terminals from two different blocks get bumped together. And, if two terminals are connected, they can't get connected to other things without getting disconnected first. So, we should expect at least something close to parity: there should be about as many disconnect events as there are connect events. Otherwise, how can a bunch of connected terminals keep connecting to other things? (Again, remember that each terminal should accept a maximum of one connection.)

So, that's pretty weird. And also possibly very bad. But, it gets even worse when you consider Problem 2.

Problem 2 - There are more connect events than there are snapshot events

In our game, we built an event that takes a snapshot of the entire state of the board at regular intervals. We did that because we knew not all actions in the game should generate events. For example, if we recorded every single time any block changed its position, that would be way too much data. On the other hand, we need to know when players move blocks even if those movements don't generate big game events (like completing a circuit). So, we compromised. Every second, the game stores a snapshot of information about the state of the board, and that event is called MakeSnapshot.

If you look at the table above (or the bar chart in our previous section), you'll notice that MakeSnapshot comes in second in our Top 5 Most Frequent Events. And, it's not a small margin, either. MakeConnectComponent is beating it by 25%. That is a Very Weird Thing.

If we assume that the snapshots are reliably firing every second, that means that On average, the system is registering block-to-block connections more than once per second. If all that data were user-generated, that means that over a period of about 44 total minutes of gameplay, kids were connecting blocks at a rate of more than one block per second, every second for the entirety of the 44 minutes.

To put that in perspective one more way:

  • if you played The Lion Sleeps Tonight by The Tokens
  • on repeat,
  • 20 consecutive times,
  • and you snapped your fingers to the beat (60bpm) the entire time
  • each snap would represent two circuit elements being connected

Rather than assuming children are connecting circuit elements at a ludicrously high rate (Problem 2), which also would seem physically impossible (Problem 1), it seems more likely that there's a problem with the game's data logging that's revealed by our data. So, let's go to the next section and explore a method for checking it out.


In [15]:
topFive = loadDataSortedByTimestamp(filepath)
topFiveMostFrequentEvents = list(ms.groupby('key').count().sort(columns=['timestamp']).index)[-5:]
frequencyFilter = topFive.key.apply(lambda x: x in topFiveMostFrequentEvents)
topFive = topFive[frequencyFilter]
topFive['delta1'] = topFive.timestamp.diff()
binBreaks = [-1, 1, 50, 100, 200, 300, 500, 1000]
# binBreaks = [1000, 2000, 3000, 4000, 5000]

p = ggplot(aes(x='delta1',
               fill='key'),
           data=topFive) + \
        geom_histogram(breaks=binBreaks) + \
        scale_x_continuous(breaks=binBreaks) + \
        ggtitle('Distribution of Time Deltas Between Successive Events') + \
        ylab('Number of Events') + \
        xlab('Time Between Events (ms)')
# print(p)
# ggsave(p, "histogram.png")
# topFive.head(n = 20)[['timestamp', 'delta1', 'key']]
print(p)


<ggplot: (302503801)>

So, we know the distribution of MakeConnectComponent events is uneven. We know that because in the last section we plotted the cumulative distribution function and saw it had a widely variable slope. What I'd like to do now is get an idea of just what that distribution of elapsed event times looks like.

To do that, we're going to use a visualization called a kernel density estimate. Essentially, what we're doing is creating a smoothed empirical approximation of what the distributions of events look like.

We're also going to use another powerful feature of graphical analysis: what statistician [Bill Cleveland][http://cm.bell-labs.com/cm/ms/departments/sia/wsc/] and Edward Tufte call "small multiples." We're actually going to take a look at the distributions of time deltas for the top 5 most frequent kinds of events and graphically compare them.


In [16]:
topFive = loadDataSortedByTimestamp(filepath)
topFiveMostFrequentEvents = list(ms.groupby('key').count().sort(columns=['timestamp']).index)[-5:]
frequencyFilter = topFive.key.apply(lambda x: x in topFiveMostFrequentEvents)
topFive['delta1'] = topFive.timestamp.diff()
topFive = topFive[frequencyFilter]


p = ggplot(aes(x = 'delta1', 
               group='key'), 
           data=topFive)
p = p + geom_density() # a Kernel Density Estimate
p = p + scale_x_continuous(limits=[-1000, 20000])
p = p + facet_wrap(y='key', 
                   ncol=1, 
                   scales='fixed')
p = p + xlab('Time Between Successive Events (ms)')
p = p + ggtitle('Smoothed Kernel Density Estimates')

print(p)
ggsave(plot=p,
       filename='kernelDensityEstimate.png')


<ggplot: (302503917)>
Saving 11.0 x 8.0 in image.

In [16]:


In [17]:
connections = loadDataSortedByTimestamp(filepath)
connections = connections[connections.key == 'MakeConnectComponent']
connections['delta1'] = connections.timestamp.diff()

binBreaks = [0, 50, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000]
p = ggplot(aes(x='delta1',
               fill='key'),
           data=connections) + \
        geom_histogram(breaks=binBreaks) + \
        scale_x_continuous(breaks=binBreaks) + \
        ggtitle('Distribution of Time Deltas Between Successive Events') + \
        ylab('Number of Events') + \
        xlab('Time Between Events (ms)')
print(p)
ggsave(plot=p, 
       filename='histogram2.png')


<ggplot: (301594873)>
Saving 11.0 x 8.0 in image.

In [18]:
topFive[topFive.key == 'MakeDisconnectComponent'][['key', 'delta1']].head()
print(topFive[topFive.key == 'MakeDisconnectComponent']['delta1'].describe())
topFive[topFive.key == 'MakeDisconnectComponent']['delta1'].plot(kind='kde')

# print(p)
# ggsave(p, "histogram.png")
# topFive.head(n = 20)[['timestamp', 'delta1', 'key']]


count     943.000000
mean      516.357370
std       981.573755
min         0.000000
25%         0.000000
50%        16.000000
75%       566.500000
max      9866.000000
Name: delta1, dtype: float64
Out[18]:
<matplotlib.axes.AxesSubplot at 0x11a37f050>

In [19]:
connects = loadDataSortedByTimestamp(filepath)
connects = connects[connects.key == 'MakeConnectComponent']
connects['delta1'] = connects.timestamp.diff()
p = ggplot(aes(x='delta1',
               fill='key'),
           data=connects) + \
        geom_histogram(breaks=binBreaks) + \
        scale_x_continuous(breaks=binBreaks) + \
        ggtitle('Distribution of Time Deltas Between Successive Events') + \
        ylab('Number of Events') + \
        xlab('Time Between Events (ms)')
        
print(p)


<ggplot: (299732397)>

In [20]:
len(connections[connections.delta1 <= 1000]) # 1744 events


Out[20]:
1744

In [21]:
columns = ['timestamp', 'key']
ms.groupby('key').count().sort(columns=['timestamp'], ascending=False)[columns]


Out[21]:
timestamp key
key
MakeConnectComponent 2609 2609
MakeSnapshot 2086 2086
MakeDisconnectComponent 943 943
MakeAddComponent 891 891
MakeRemoveComponent 840 840
MakeCircuitCreated 229 229
MakeResetBoard 211 211
MakeSpawnFish 161 161
MakeCaptureFish 136 136
MakeEndGame 100 100
MakeSummonBoard 96 96
MakeStartGame 92 92
MakeModeChange 45 45
MakeVisabilityChange 43 43
ADAGEStartSession 23 23

15 rows × 2 columns


In [22]:
# We can also see what this looks like as a plot
msdata = ms.groupby('key').count().sort(columns=['timestamp', 'key'], ascending=False)
p = msdata['timestamp'].plot(kind='bar')
print(p)
pl.savefig("barChart.jpg", 
           dpi=300, 
           figsize=(8, 11),
           bbox_inches='tight')


Axes(0.125,0.125;0.775x0.775)

How many components are in the component lists of disconnect events

From our meeting on 2014-05-30 Allison confirmed that MakeConnectComponet always generates a component_list of the two blocks being connected.

Matthew also said that every time a connection is established between blocks A and B, it generates two MakeConnectComponent events:

  • A connecting to B and
  • B connecting to A

So, I tried investigating whether all MakeDisconnectComponent events generate a component_list with just two components.

But, my first query below failed, because it seems some events have null values for component_list:

(ms[ms.key == 'MakeDisconnectComponent']['component_list']).apply(lambda x: len(x))

So, let's try another tactic. First, we'll create a filter that returns null values for component_lists.


In [23]:
nullComponentLists = pd.isnull(ms['component_list'])
ms[nullComponentLists][ms.key == 'MakeDisconnectComponent'][['key', 'component_list']]


/Users/briandanielak/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/frame.py:1686: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  "DataFrame index.", UserWarning)
Out[23]:
key component_list
980 MakeDisconnectComponent None
1154 MakeDisconnectComponent None
1252 MakeDisconnectComponent None
1274 MakeDisconnectComponent None
2958 MakeDisconnectComponent None
3018 MakeDisconnectComponent None
3200 MakeDisconnectComponent None
3262 MakeDisconnectComponent None
3304 MakeDisconnectComponent None
4473 MakeDisconnectComponent None
7228 MakeDisconnectComponent None

11 rows × 2 columns

Event 980 is the first disconnect event where there is no component_list, so let's check out what's happening in the neighborhood of that event


In [24]:
ms[970:986][['key', 'timestamp']]


Out[24]:
key timestamp
970 MakeConnectComponent 1400003097166
971 MakeDisconnectComponent 1400003098417
972 MakeCaptureFish 1400003098417
973 MakeDisconnectComponent 1400003098433
974 MakeConnectComponent 1400003098966
975 MakeCircuitCreated 1400003098966
976 MakeConnectComponent 1400003098966
977 MakeConnectComponent 1400003099232
978 MakeSpawnFish 1400003099634
979 MakeSnapshot 1400003100748
980 MakeDisconnectComponent 1400003101299
981 MakeCaptureFish 1400003101299
982 MakeDisconnectComponent 1400003101299
983 MakeDisconnectComponent 1400003101314
984 MakeSnapshot 1400003105748
985 MakeConnectComponent 1400003107948

16 rows × 2 columns

So, it looks like events 980, 981, and 982 all share the exact same timestamp (down to the millisecond), with one final disconnection event happening only about 15ms later.


In [25]:
ms[980:984][['key', 'timestamp', 'delta1']]


Out[25]:
key timestamp delta1
980 MakeDisconnectComponent 1400003101299 551
981 MakeCaptureFish 1400003101299 0
982 MakeDisconnectComponent 1400003101299 0
983 MakeDisconnectComponent 1400003101314 15

4 rows × 3 columns

We can try to interrogate what's happening by looking at the component lists. First, let's look inside the circuit that was created at event 975. I'm calling list() on it because I kind of don't understand how else to get pandas to get me the output format I need to inspect :-)


In [26]:
list(ms[975:978].component_list)


Out[26]:
[[{u'current_flowing': True,
   u'marker_id': 2,
   u'negative_terminal': 19,
   u'positive_terminal': 27,
   u'theta': 30.2151107788086,
   u'type': u'BATTERY',
   u'x': 376.921844482422,
   u'y': 292.583557128906},
  {u'current_flowing': True,
   u'marker_id': 27,
   u'negative_terminal': 2,
   u'positive_terminal': 19,
   u'theta': 14.361967086792,
   u'type': u'LED',
   u'x': 505.863616943359,
   u'y': 337.244506835938},
  {u'current_flowing': True,
   u'marker_id': 19,
   u'negative_terminal': 27,
   u'positive_terminal': 2,
   u'theta': 262.492034912109,
   u'type': u'RESISTOR',
   u'x': 712.927917480469,
   u'y': 163.924194335938}],
 [{u'current_flowing': True,
   u'marker_id': 2,
   u'negative_terminal': 19,
   u'positive_terminal': 27,
   u'theta': 30.2151107788086,
   u'type': u'BATTERY',
   u'x': 376.921844482422,
   u'y': 292.583557128906},
  {u'current_flowing': True,
   u'marker_id': 27,
   u'negative_terminal': 2,
   u'positive_terminal': 19,
   u'theta': 14.361967086792,
   u'type': u'LED',
   u'x': 505.863616943359,
   u'y': 337.244506835938}],
 [{u'current_flowing': True,
   u'marker_id': 27,
   u'negative_terminal': 2,
   u'positive_terminal': 19,
   u'theta': 14.1408853530884,
   u'type': u'LED',
   u'x': 509.764862060547,
   u'y': 342.452362060547},
  {u'current_flowing': True,
   u'marker_id': 2,
   u'negative_terminal': 19,
   u'positive_terminal': 27,
   u'theta': 16.4315128326416,
   u'type': u'BATTERY',
   u'x': 409.187103271484,
   u'y': 336.581115722656}]]

So, the circuit is a 3-block circuit. It has:

  • A battery (block id 2)
  • A resistor (block id 19)
  • An LED (block id 27)

After a fish hits that circuit, it fries all the virtual wires (because the fish are biolectric), so we should expect to see 3 disconnect events:

  • Disconnecting the battery and resistor (2-19)
  • Disconnecting the battery and the LED (2-27)
  • Disconnecting the battery and the LED (19-27)

In [27]:
list(ms[980:984].component_list)


Out[27]:
[None,
 [{u'current_flowing': True,
   u'marker_id': 28,
   u'negative_terminal': 31,
   u'positive_terminal': 6,
   u'theta': 222.557052612305,
   u'type': u'LED',
   u'x': 284.760711669922,
   u'y': -138.952453613281}],
 [{u'current_flowing': True,
   u'marker_id': 6,
   u'negative_terminal': -1,
   u'positive_terminal': 31,
   u'theta': 277.101806640625,
   u'type': u'BATTERY',
   u'x': 271.590301513672,
   u'y': -393.121185302734}],
 [{u'current_flowing': True,
   u'marker_id': 6,
   u'negative_terminal': -1,
   u'positive_terminal': 31,
   u'theta': 277.101806640625,
   u'type': u'BATTERY',
   u'x': 271.590301513672,
   u'y': -393.123443603516}]]

In [28]:
ms[980:984].timestamp


Out[28]:
980    1400003101299
981    1400003101299
982    1400003101299
983    1400003101314
Name: timestamp, dtype: int64

In [28]: